{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "W3BWz6766jYQ"
   },
   "source": [
    "# Prediction "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pvKWP97u6jYY"
   },
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8-yiVTRl6jYY"
   },
   "source": [
    "In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.\n",
    "\n",
    "In the following wage example,  𝑌  is the hourly wage of a worker and  𝑋  is a vector of worker's characteristics, e.g., education, experience, sex. Two main questions here are:\n",
    "\n",
    "* How to use job-relevant characteristics, such as education and experience, to best predict wages?\n",
    "\n",
    "* What is the difference in predicted wages between male and female workers with the same job-relevant characteristics?\n",
    "\n",
    "In this lab, we focus on the prediction question first.\n",
    "\n",
    "## Data\n",
    "\n",
    "The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below  3 .\n",
    "\n",
    "The variable of interest  𝑌  is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size  $n=5150$ ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qb_bGht76jYb"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import statsmodels.api as sm\n",
    "import sklearn.linear_model as lm\n",
    "import sklearn.metrics as metrics\n",
    "import statsmodels.formula.api as smf\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.model_selection import cross_val_score, train_test_split\n",
    "from sklearn.metrics import r2_score\n",
    "import warnings\n",
    "# ignore potential convergence warnings; for some small penalty levels, tried out, optimization might not converge\n",
    "warnings.simplefilter('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "20qWNJ6I6jYb"
   },
   "source": [
    "## Data Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2vf4YgrS6jYc"
   },
   "outputs": [],
   "source": [
    "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv\"\n",
    "df = pd.read_csv(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bi7bgBZd6jYd",
    "outputId": "0ccfe96a-a10d-47e6-8eb8-ddba57284a74"
   },
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_5hnESw-6jYf"
   },
   "source": [
    "### Construct variables\n",
    "\n",
    "We are constructing the output variable  $Y$  and the matrix  $Z$  which includes the characteristics of workers that are given in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UP1QSwi16jYf"
   },
   "outputs": [],
   "source": [
    "Y = np.log(df['wage'])\n",
    "Z = df.drop(['wage', 'lwage'], axis=1)\n",
    "Z.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "W38QGEM56jYf"
   },
   "source": [
    "For the outcome variable wage and a subset of the raw regressors, we calculate the empirical mean and other empirical measures to get familiar with the data.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_8ZXZCt16jYg",
    "outputId": "1b5b28ae-5734-4c51-9fe8-e01d93630eb5"
   },
   "outputs": [],
   "source": [
    "Z.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "q6Gn9T7I6jYg"
   },
   "source": [
    "E.g., the share of female workers in our sample is ~44% ( 𝑠𝑒𝑥=1  if female)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "t3CSumTv6jYg",
    "outputId": "10f28669-f5c7-40fb-e161-3b2d17ce9343"
   },
   "outputs": [],
   "source": [
    "# if you want to print this table to latex\n",
    "print(Z.describe().style.to_latex())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2vcHDz8x6jYh"
   },
   "source": [
    "## Prediction Question"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eqPNtBFR6jYh"
   },
   "source": [
    "Now, we will construct a prediction rule for hourly wage  $Y$, which depends linearly on job-relevant characteristics  $X$:\n",
    "\n",
    "$$\n",
    "𝑌=\\beta′𝑋+𝜖.\n",
    "$$\n",
    "\n",
    "Our goals are:\n",
    "\n",
    "* Predict wages using various characteristics of workers.\n",
    "\n",
    "* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$  and the out-of-sample MSE and  $R^2$ .\n",
    "\n",
    "We employ two different specifications for prediction:\n",
    "\n",
    "1. **Basic Model**:  $𝑋$  consists of a set of raw regressors (e.g. sex, experience, education indicators, occupation and industry indicators, regional indicators).\n",
    "2. **Flexible Model**:  $𝑋$  consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g.,  $exp^2$  and  $exp^3$ ) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is experience times the indicator of having a college degree.\n",
    "\n",
    "Using the **Flexible Model**, enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.\n",
    "\n",
    "Now, let us fit both models to our data by running ordinary least squares (ols):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ww9XEg9q6jYh"
   },
   "outputs": [],
   "source": [
    "# 1. Basic Model\n",
    "model_base = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)'\n",
    "base = smf.ols(model_base, data=df)\n",
    "results_base = base.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rsquared_base = results_base.rsquared\n",
    "rsquared_adj_base = results_base.rsquared_adj\n",
    "mse_base = np.mean(results_base.resid**2)\n",
    "mse_adj_base = results_base.mse_resid\n",
    "print(f'Rsquared={rsquared_base:.4f}')\n",
    "print(f'Rsquared_adjusted={rsquared_adj_base:.4f}')\n",
    "print(f'MSE={mse_base:.4f}')\n",
    "print(f'MSE_adjusted={mse_adj_base:.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# verify the formulas\n",
    "X, y = base.data.exog, base.data.endog\n",
    "n, p = X.shape\n",
    "mse = np.mean((y - results_base.predict(X, transform=False))**2)\n",
    "mse_adj = mse * n / (n - p)\n",
    "rsquared = 1 - mse / np.var(y)\n",
    "rsquared_adj = 1 - mse_adj / np.var(y)\n",
    "print(f'Rsquared={rsquared:.4f}')\n",
    "print(f'Rsquared_adjusted={rsquared_adj:.4f}')\n",
    "print(f'MSE={mse:.4f}')\n",
    "print(f'MSE_adjusted={mse_adj:.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ydQNNmzs6jYi"
   },
   "outputs": [],
   "source": [
    "# 2. Flexible Model\n",
    "model_flex = ('lwage ~ sex + shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we '\n",
    "                      '+ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)')\n",
    "flex = smf.ols(model_flex, data=df)\n",
    "results_flex = flex.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rsquared_flex = results_flex.rsquared\n",
    "rsquared_adj_flex = results_flex.rsquared_adj\n",
    "mse_flex = np.mean(results_flex.resid**2)\n",
    "mse_adj_flex = results_flex.mse_resid\n",
    "print(f'Rsquared={rsquared_flex:.4f}')\n",
    "print(f'Rsquared_adjusted={rsquared_adj_flex:.4f}')\n",
    "print(f'MSE={mse_flex:.4f}')\n",
    "print(f'MSE_adjusted={mse_adj_flex:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2uxr3OZT6jYj"
   },
   "source": [
    "#### Next we try the Lasso\n",
    "\n",
    "We use the statsmodels package with the formula api for uniformity in feature construction and the sklearn Lasso with cross-validation to tune the regularization hyperparameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ynf9WMhT6jYj"
   },
   "outputs": [],
   "source": [
    "# Lasso with cross-validation\n",
    "X = flex.data.exog[:, 1:] # exclude the intercept; we don't want the lasso to penalize the intercept\n",
    "y = flex.data.endog\n",
    "\n",
    "# train model using Lasso with cross validation and variable normalization\n",
    "lasso = Pipeline([('scale', StandardScaler()), # standardize the variables\n",
    "                  ('lasso', lm.LassoCV())])\n",
    "lasso.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# verify the formulas\n",
    "n, p = X.shape\n",
    "p += 1\n",
    "mse_lasso = np.mean((y - lasso.predict(X))**2)\n",
    "mse_adj_lasso = mse_lasso * n / (n - p)\n",
    "rsquared_lasso = 1 - mse_lasso / np.var(y)\n",
    "rsquared_adj_lasso = 1 - mse_adj_lasso / np.var(y)\n",
    "print(f'Rsquared={rsquared_lasso:.4f}')\n",
    "print(f'Rsquared_adjusted={rsquared_adj_lasso:.4f}')\n",
    "print(f'MSE={mse_lasso:.4f}')\n",
    "print(f'MSE_adjusted={mse_adj_lasso:.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "slxv2Km-6jYm",
    "outputId": "a8edb46f-ec83-430f-b8bb-336e22b3a895"
   },
   "outputs": [],
   "source": [
    "# store the results in a table\n",
    "res_df = pd.DataFrame()\n",
    "\n",
    "res_df['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']\n",
    "\n",
    "res_df['p'] = [results_base.params.shape[0],\n",
    "               results_flex.params.shape[0],\n",
    "               results_flex.params.shape[0]]\n",
    "\n",
    "res_df['R2'] = [rsquared_base, rsquared_flex, rsquared_lasso]\n",
    "res_df['MSE'] = [mse_base, mse_flex, mse_lasso]\n",
    "\n",
    "res_df['adj_R2'] = [rsquared_adj_base, rsquared_adj_flex, rsquared_adj_lasso]\n",
    "res_df['adj_MSE'] = [mse_adj_base, mse_adj_flex, mse_adj_lasso]\n",
    "\n",
    "# Show results\n",
    "res_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9mZfw8rQ6jYm",
    "outputId": "4a9c6ad8-af09-4d06-a832-47f11686c8fc"
   },
   "outputs": [],
   "source": [
    "# print to Latex\n",
    "print(res_df.style.to_latex())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f--XOr256jYm"
   },
   "source": [
    "## Data Splitting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bZIdxq3X6jYn"
   },
   "source": [
    "Measure the prediction quality of the two models via data splitting:\n",
    "\n",
    "* Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophiscticated version of splitting that we can consider).\n",
    "* Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.\n",
    "* Use the testing sample for evaluation. Predict the  𝚠𝚊𝚐𝚎  of every observation in the testing sample based on the estimated parameters in the training sample.\n",
    "* Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-aB4UH1T6jYn"
   },
   "outputs": [],
   "source": [
    "# split the data into training and testing sets\n",
    "\n",
    "# get the indices\n",
    "n = df.shape[0]\n",
    "train_idx, test_idx = train_test_split(np.arange(n), test_size=.2, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "q9U8LYmL6jYn",
    "outputId": "f0094a73-1c2d-4525-8813-d031a1448706"
   },
   "outputs": [],
   "source": [
    "# Basic model\n",
    "\n",
    "# estimating the parameters in the training sample\n",
    "mod1 = smf.ols(model_base, data=df)\n",
    "X1 = mod1.data.exog\n",
    "y1 = mod1.data.endog\n",
    "\n",
    "# separate training and testing sets\n",
    "X1_train = X1[train_idx,:]\n",
    "X1_test = X1[test_idx,:]\n",
    "y1_train = y1[train_idx]\n",
    "y1_test = y1[test_idx]\n",
    "\n",
    "# estimating the parameters in the training sample\n",
    "regbasic = sm.OLS(y1_train, X1_train).fit()\n",
    "\n",
    "# predict out of sample\n",
    "trainregbasic = regbasic.predict(X1_test)\n",
    "\n",
    "# calculating out-of-sample MSE\n",
    "MSE_test1 = sum((y1_test-trainregbasic)**2)/y1_test.shape[0]\n",
    "R2_test1 = 1. - MSE_test1/np.var(y1_test)\n",
    "\n",
    "print(\"Test MSE for the basic model: \"+ str(MSE_test1))\n",
    "print(\"Test R2 for the basic model: \"+ str(R2_test1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "azUeNf8G6jYn"
   },
   "source": [
    "In the basic model, the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  is quite closed to the  $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "B9b9XAix6jYo",
    "outputId": "18838538-f63e-41a9-9a33-cfd7808e59e7"
   },
   "outputs": [],
   "source": [
    "# Flexible model\n",
    "\n",
    "# estimating the parameters in the training sample\n",
    "mod2 = smf.ols(model_flex, data=df)\n",
    "X2 = mod2.data.exog\n",
    "y2 = mod2.data.endog\n",
    "\n",
    "# separate training and testing sets\n",
    "X2_train = X2[train_idx,:]\n",
    "X2_test = X2[test_idx,:]\n",
    "y2_train = y2[train_idx]\n",
    "y2_test = y2[test_idx]\n",
    "\n",
    "# estimating the parameters in the training sample\n",
    "regflex = sm.OLS(y2_train, X2_train).fit()\n",
    "\n",
    "# predict out of sample\n",
    "trainregflex = regflex.predict(X2_test)\n",
    "\n",
    "# calculating out-of-sample MSE\n",
    "MSE_test2 = np.mean((y2_test - trainregflex)**2)\n",
    "R2_test2 = 1. - MSE_test2 / np.var(y2_test)\n",
    "\n",
    "print(\"Test MSE for the flexible model: \"+ str(MSE_test2))\n",
    "print(\"Test R2 for the flexible model: \"+ str(R2_test2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-EEZIiCF6jYo"
   },
   "source": [
    "In the flexible model, the discrepancy between the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  and the  $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$  is not large.\n",
    "\n",
    "It is worth to notice that the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.\n",
    "\n",
    "Nevertheless, we observe that, based on the out-of-sample  $𝑀𝑆𝐸$ , the basic model using ols regression performs is about as well (or slightly better) than the flexible model.\n",
    "\n",
    "Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (least absolute shrinkage and selection operator) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors  $p$  is relatively large in relation to  $n$ .\n",
    "\n",
    "Note that the out-of-sample  𝑀𝑆𝐸  on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "05auay496jYp",
    "outputId": "3d3d5fe8-9b53-4098-d1f2-77e9051f8c14"
   },
   "outputs": [],
   "source": [
    "# train model using Lasso with cross validation and variable normalization\n",
    "lasso = Pipeline([('scale', StandardScaler()), # standardize the variables\n",
    "                  ('lasso', lm.LassoCV())])\n",
    "lasso.fit(X2_train[:, 1:], y2_train)\n",
    "\n",
    "# predict out of sample\n",
    "trainreglasso = lasso.predict(X2_test[:, 1:])\n",
    "\n",
    "# calculating out-of-sample MSE\n",
    "MSE_test3 = np.mean((y2_test - trainreglasso)**2)\n",
    "R2_test3 = 1. - MSE_test3 / np.var(y2_test)\n",
    "\n",
    "print(\"Test MSE for the basic model: \"+ str(MSE_test3))\n",
    "print(\"Test R2 for the basic model: \"+ str(R2_test3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "uO6YxpPw6jYp",
    "outputId": "789c392a-9b4f-493c-c8ba-55200ff21412"
   },
   "outputs": [],
   "source": [
    "# store the results in a table\n",
    "res_df2 = pd.DataFrame()\n",
    "\n",
    "res_df2['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']\n",
    "\n",
    "res_df2['$MSE_{test}$'] = [MSE_test1, MSE_test2, MSE_test3]\n",
    "res_df2['$R^2_{test}$'] = [R2_test1, R2_test2, R2_test3]\n",
    "\n",
    "# Show results\n",
    "res_df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UJUxIXcP6jYq",
    "outputId": "29cbd012-db0e-4b1b-8e48-83c3cb21337f"
   },
   "outputs": [],
   "source": [
    "# print to Latex\n",
    "print(res_df2.style.to_latex())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "PM1_prediction.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
